Data Visualization


Data representation can be simply represented by letters and numbers. In fact, this is the most precise and simple way of carrying out the task. The problem arrives when the data being represented is scaled up and it becomes too numerous and overwhelming for human comprehension. Representation through numbers and letters just becomes a "sea of text" and it becomes rather difficult to distinguish values and the gravity of those values in relation to other parts of the data.
Humans are better at understanding when information is visualized. This is what this report aims to argue: information visualization not only helps to augment the aesthetic experience but it also makes it easier for human comprehension and understanding of large amount of quantitative data.

This written report discusses and documents the use of Data-Driven Design (D3) javascript library to enable dynamic data visualizations on any web browser sites. This report does not include a tutorial on how to use the javascript library but instead it shows how to integrate the tool in order to effective visualize and easily digest large amounts of data.

Data Visualization Tool
For a brief background information about the tool, D3 is a data visualization tool and a compilation of javascript codes made into a library. It's a very flexible library that can be applied to any websites. It does require you to invest a good amount of your time to learn how to use the library. D3 was selected objectively picked as the tool for several reasons: it can handle dynamic data(JSON, PHP, and JQUERY) and it can be deployed for production or for coding anywhere with no software or operating system requirements. Aside from its very useful functionality, its visualizations include realistic animations. For full discussion about the tool, you can read more on "Lab 2 reports - D3" at the lower portion of this web page. This report specifically made use of the Force-Directed Graph tool of the D3 library which is illustrated like this:
Screen_shot_2011-12-01_at_6.49.27_PM.png


Data being represented
The data being represented is a database originating from Fourthdraft. Fourthdraft is an online portfolio system that provides users a way of creating their own portfolio online. It has a built in simple CMS that allows you to flexibly upload, organize and present your portfolio. The site features several ways to collect feedback and information about your visitors and simple site activity. For instance, the site has a "Simple Site Statistics" that tracks the number of page views, which pages(albums), visits, etc. Aside from the the quantitative surveillance values, the developer figured out a way to collect not only how many people are visiting but also what people are saying about a certain portfolio for each visit. The site collects relevant data about each visit with regards to the intention.
Since the values being populated are user-generated it can be safely assumed that the scale of the database size can increase significantly. This problem presents a threat that in a given situation where you have over a hundred entries of user input, it would be difficult to manage, filter, and make sense of all the database entries manually.

Preparation
A sample Fourthdraft user portfolio was customized to receive text inputs for each albums. Viewers are encouraged to describe each album in one word and enter it in the text field. The input would then be saved in the database for future retrieval and visualization. The single database contains all 'keyword description' entries that are just properly addressed and linked to each user accounts of the website. Manually displaying the database would be extremely hard to process since each rows of entry can be different keywords for different albums or creative works for different users. Attached is a snapshot of the database:
Screen_shot_2011-12-01_at_6.27.31_PM.png

Process
Given the tool and information the database, the force directed graph was applied to the database. The representation would portray the username as the center node. The first degree connections would be the albums that the user has. The second degree connections would be the 'creative works' that are contained in that album. For demo purposes though, this level of connection is omitted but instead, the second degree connections that are connected to the first are the keywords that are inputted by users to describe each albums. After application, the web page would contain a javascript object that looks like this:
Screen_shot_2011-12-01_at_7.13.05_PM.png
Do take into consideration the aesthetic improvements this diagram would still require for this image shows a 'rough draft' of the function of the interface. This visualization shows aggregates all the albums of one user and automatically retrieves all the albums tied to that user along with the keywords that were inputted for each album. With this we can make sense and understand, just by first glance, that aggregated information on the database.

If we again consider the second image–the snapshot of the database entries–and include more entries, it would possibly display like this:
duplicated.jpg

If we were to consider this dataset, it would be difficult to decipher what the set is trying to convey. On a practical setting, fourthdraft could've just provided the users access to the main database and let them independently run queries to find results that they own or pertain to their account. That would not be a viable solution though for basic and practical security and organizational reasons but also for human interface and interaction reasons. It would be impractical to expect web users to go through the hassle of 'digging' through information in order to get the details they require. Even if the user has overcome that experience difficulty, it would be hard to easily and quickly comprehend the entries that pertains to a user. It would also be difficult to tally each entries even if one were to locate all the rows for a user. Overall, it would be difficult for a user to receive the general picture–the general idea perhaps–of what the information is trying to convey. It would also be difficult to relate one value from the other in a standard, unstructured table format.

Data visualization becomes extremely help in contexts like these. Not onto does it make more aesthetically pleasing but it becomes 'easily digestible' and easier for the user to consume the idea behind the data.

"Visualization can support knowledge management by facilitating
knowledge sharing and knowledge creation. Knowledge itself is difficult
to visualize because it often exists only in someone’s mind (referred to as
tacit knowledge) (Nonaka, 1994). Visualization can accelerate internalization
by presenting information in an appropriate format or structure
or by helping users find, relate, and consolidate information (and thus
helping to form knowledge) (C. Chen & Paul, 2001; Cohen, Maglio, &
Barrett, 1998; Foner, 1997; Vivacqua, 1999)" (Zhu, 171)

Works Cited
"Information Visualization." Proquest. N.p., n.d. Web. 1 Dec. 2011. <ezproxy.qa.proquest.com.myaccess.library.utoronto.ca/docview/202735259?accountid=14771>.
Jakobsen, Thomas. "Advanced Character Physics." Internet Archive: Wayback Machine. N.p., n.d. Web. 1 Dec. 2011. <http://web.archive.org/web/20080410171619/http:www.teknikus.dk/tj/gdc2001.htm>.
Zhu, Bin. "Information Visualization." Information Science and Technology 39.1 (2006): n. pag. Serials Solutions 360Link//. Web. 1 Dec. 2011.




Lab Reports


Lab 1 - R Studio


Screen_shot_2011-11-16_at_4.07.27_PM.png
*Fig 1: R Studio screenshot : Mac : Default sample files (hotdogs)

First Impressions

My experience with R started with the guidelines given by the professor here. Our professor had an almost detailed and informative documentation of the program/GUI but afterwards, I grew concerned that "does this mean R is a difficult tool to use? If he is giving detailed simple steps then I'm under the impression that R is not easy-to-use nor easy-to-figure out". I downloaded, installed and configured everything anyways.

Inside the Software

The UI seems familiar. It looks like a visual basic UI or Eclipse or Apple's Cocoa/objective-C. R also is run by codes. Syntactically, it should be the same (semantically) and it should resemble basic programming codes I know already. Getting the program to work was easy. Maybe this was due to the documentation, instructions and guidelines the professor provided. Being able to run it without errors at first run was a short reassurance for me.

"Houston, We Have a Problem"

Loading the preset samples and following the guidelines were immaculate. The next step was to edit the datasets, edit information (possibly aesthetics) as well and not cause any errors. At this point I was trying to 'learn' the programming language that comes with R. Unfortunately though I ran across several problems:

1."Set as Working Directory"

For R to work, you have to set which directory in your filesystem you are working on. All linking, file accessing, reference, etc will be relative to this location. (or else you will see the error on the lower left hand side of Fig 1). Luckily, it is rather easy to set folders "as working directory"
Screen-Shot-2011-10-06-at-7.06.41-AM.png
Unfortunately–I don't know if this is a machine problem of mine but–this process is only good for your current session. That means the next time you open R Studio you would need to set the current working directory again. This, I find, is tedious.

2. Editing Data (and learning the code in the long run)

Screen_shot_2011-11-16_at_4.43.26_PM.png
Fig 2: Same opened file, different lines of codes

Using the same sample file, I tried editing familiar variables. In fig 2, I changed the initial value of "USA" (line 20) to "Japan". I made sure this country was actually part of the visualization. I did not just pick a country out of the 292-294 countries out there. I also changed "#000000" to "#ff9933" at one point since that was obvious to me that that refers to a hexadecimal color code and that it stands for the color of the bar graph.

In both separate attempts, the program did not run accordingly. When I specified orange (ff9933), it still came out red. The previous (default) run was set to red. When I specified Japan, it still showed USA's. Basically, it was showing the same graph. It's not that there were errors, it was probably one of the following:
  • The application is not updating the preview
  • I was editing the wrong file (verified, not the case)
  • My coding is wrong / there is another procedure I'm missing (verified, not the case)

I tried doing multiple troubleshooting styles but it didn't work. On some cases, it will run but it will show the old data. On some cases, it will not run due to an error I do not know came to be (when I didn't edit the code).

3. Findings

I spent a good amount of time trying to understand (just the code and error I caused) and debug my errors. It almost came close to an hour and when I got home, I still tried to figure it out but didn't have any luck. That's a lot of time just to figure out an error and it's not even the entirety of the code yet. To my assessment, I concluded that R is not the most viable way to visualize information (at least for me). There is a learning curve you have to address. The programming is one. The User Interface is another. If I had a lot of spare time, I would guess that this would be a very powerful tool to use. But given my limitations and considering that in the grand scheme of things, this is just a tool for visualizing information, not creating or providing information. It's just an aesthetic tool. If one was handling huge amounts of data and/or information, I would assume that their task is highly important and tedious. Hence, it wouldn't be an option for them to dedicate more time (they don't have) in learning something new for such a little return (referring to the grand scheme). This is my case.

R has been recommended by our professor and by a couple of people so that would mean there is something to it. Unfortunately, I haven't had the chance to experience this first hand.



Lab 2 - Data-Driven Documents

Screen_shot_2011-11-16_at_7.19.39_PM.png


D3 (Data Driven Documents) is a very flexible information visualization tool that is code-based. There is little to no user interface with this tool since–as the photo above states–"it is a small free Javascript library for manipulating documents based on data". One must know how to use and code with javascript in order to use this and based from the nature of javascript, most of it is web-based.

I initially picked Tableau Public as my tool but one of its limitation is that it only runs on Windows (I use Mac). Not to perpetuate macs and I have no problems working with pc's but that would mean I would have to look for a PC machine every time I have to create/edit files (which I don't have). I dropped this idea then picked Impure.
Impure is flash-based and can be accessed and presented over the web. While it is portable on the web which I really preferred, it presents certain limitations since it runs on flash. Not all devices can accommodate to this feature. Both of them are good tools but they possess limitations that I was concerned about.

Criteria
Personally, I was looking for a tool I can really take advantage; something I can use in the future. I'm a web designer / developer and this lab can actually extend my skills, knowledge and toolset. With that, I was looking for a visualizing tool that is:
  • portable; gives me the ability to work online (not software based) and allows users to view the product online as well
  • platform-independent; This allows me to work on it and assures that it works universally for all my viewers
  • something code-based. I didn't want a UI-based tool because that would mean it will have to run on application and the problem will go back to #2 point
  • able to handle dynamic data. Tableau and Impure are good, but their data relies on a static file/document somewhere on-offline. I need a visualizer that can feed through dynamic data like csv's, xml's or even online database like MySQL

And then I discovered D3 (through my professor). It has everything I want. It requires a bit of a learning curve but I do not mind because any amount of time I invest in this tool is beneficial for me in the future.
*additional features: D3 is animated and it possesses several unique data visualization tools other than the basic bar graphs and such.

The learning curve really was the most difficult thing in this tool, but after that it's smooth sailing from there. There are limited helps, forums and reviews online (for now) since the tool is relatively new, but all those guides you can find online–especially the tutorials and documentations provided in the D3 website itself–are really useful.
Screen_shot_2011-11-16_at_7.41.33_PM.png
*example of their tutorial. It's easy, simple and highly interactive (it animates when you click run)

Static Data
At first, I followed their tutorials and documentation and I tried to understand how it works and how to use it. Having a good amount of understanding of javascript was really useful. It is highly suggested that you learn javascript first before using this tool. The first project I did was a simple bar graph visualization of data contained in a set of variables.

Screen_shot_2011-11-16_at_7.40.32_PM.png
^Output
Screen_shot_2011-11-16_at_7.47.02_PM.png
^Code


As you can see, the effort of coding is almost directly proportional to the result you get. It is fairly easy given that you know the markup languages involved.

Dynamic / Live Data
Now as for my intention, I wanted to utilize this tool so that it feeds the information from a dynamic source. A realtime application of this test was sampling a feature for Fourthdraft.com. Initially I tried to manipulate and visualize data based on the database I have from Fourthdraft. Fourthdraft is featured with a simple statistics feature that keeps count of how many times a portfolio was viewed. I used this data to visualize the number of views(bar graph) each user(each row) has.
Screen_shot_2011-11-16_at_7.53.30_PM.png
and it worked!

The code is as follows:
<script>
    var livedata = [{"id":"1","pageviews":"4335"},{"id":"3","pageviews":"3538"},{"id":"5","pageviews":"65"},{"id":"7","pageviews":"2"},{"id":"8","pageviews":"789"},{"id":"9","pageviews":"945"},{"id":"10","pageviews":"1184"},{"id":"11","pageviews":"164"},{"id":"12","pageviews":"5"},{"id":"13","pageviews":"9"},{"id":"14","pageviews":"1395"},{"id":"15","pageviews":"197"},{"id":"16","pageviews":"13"},{"id":"17","pageviews":"1247"},{"id":"18","pageviews":"58"},{"id":"19","pageviews":"0"},{"id":"20","pageviews":"26"},{"id":"21","pageviews":"1542"},{"id":"22","pageviews":"6"},{"id":"23","pageviews":"0"},{"id":"24","pageviews":"32"},{"id":"25","pageviews":"405"},{"id":"26","pageviews":"17"},{"id":"27","pageviews":"11"},{"id":"28","pageviews":"2"},{"id":"29","pageviews":"23"},{"id":"30","pageviews":"70"},{"id":"31","pageviews":"0"},{"id":"32","pageviews":"10"},{"id":"33","pageviews":"2053"},{"id":"34","pageviews":"13"},{"id":"35","pageviews":"11"},{"id":"36","pageviews":"5"},{"id":"37","pageviews":"7"},{"id":"38","pageviews":"0"},{"id":"39","pageviews":"22"},{"id":"40","pageviews":"727"},{"id":"41","pageviews":"0"},{"id":"42","pageviews":"436"},{"id":"43","pageviews":"0"},{"id":"44","pageviews":"9"},{"id":"45","pageviews":"34"},{"id":"46","pageviews":"28"},{"id":"47","pageviews":"8"},{"id":"48","pageviews":"13"},{"id":"49","pageviews":"2"},{"id":"50","pageviews":"4"},{"id":"51","pageviews":"55"},{"id":"52","pageviews":"29"},{"id":"53","pageviews":"345"},{"id":"54","pageviews":"0"},{"id":"55","pageviews":"57"},{"id":"56","pageviews":"6"},{"id":"57","pageviews":"9"},{"id":"58","pageviews":"16"},{"id":"59","pageviews":"8"},{"id":"60","pageviews":"3"},{"id":"61","pageviews":"3"}];
    var w = 20;
    var h = 300;
 
 
    var livechart = d3.select("#livediv")
    .append("div")
    .attr("class", "livechart");
 
    livechart.selectAll("div")
    .data(livedata)
    .enter().append("div")
    .style("width", function(d) {return d.pageviews/3+"px"; })
    .text(function(d) {return d.pageviews;});
</script>
"var livedata" was dynamically generated through PHP and was retrieved in a MySQL database. The rest is the visualizer.



Conclusion
I personally had fun learning and making use of D3 and it know it will be useful for me in the future. The codes are not explained since this documentation is not intended to be a tutorial, but rather just my findings about the tool. I find the tool really flexible and useful for the web (and web dev and designers). There are a lot of possible applications this tool can be used as. As a developer of Fourthdraft, I have a couple of ideas for soe applications of this tool and it may even be used as a feature of the site. It was really important for me to follow my criteria as stated above for it to be useful for me. D3 was the only tool I discovered (thus far) that fits the bill. D3 is completely functional and useful but is also pleasing both to develop and use. Animations are included in some visualizations. No additional coding is needed. some visualizations are interactive and you can click and filter through the data. Overall, I recommend using D3 if our intentions and purpose are parallel to each other. It is both super functional and aesthetically appealing.